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ABSTRACT 

This paper analyzes the 19 studies presented to the 
National Institute of Education's (NIE) panel on the effects of 
school desegregation on black achievement and discusses the author's 
own findings. The author concludes that desegregation did not cause 
any decrease in black achievement generally, nor did it cause any 
increase in math achievement. Although desegregation increased mean 
reading levels, the distribution of reading effects appeared to be 
skewed, with a disproportionate number of school districts obtaining 
atypically high gains. Studies with the largest gains were 
characterized along a number of methodological and substantive 
dimensions (none of which could be isolated as causes of the 
atypically high reading gains) including: small sample size, two or 
more years of desegregation, desegregated children who outperformed 
their segregated counterparts even before desegregation began, and 
desegregation that occurred earlier, was voluntary, occurred in 
schools with larger percentages of whites, and was associated with 
enrichment programs. Because of the small samples in the NIE project, 
and the apparently non-normal distributions, the author states he is 
not confident that anything has been learned about desegregation's 
effect j on reading on the average. Across the few studies examined 
he found that variability in effect sizes was more striking and less 
well understood than any measure of central tendency. The paper ends 
with a review of the implications of the findings for various 
interest groups and a summary of the implications the NIE project has 
for theories of research synthesis. (CMG) 
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INTRODUCTION 



My assignment Is to comment on the foregoing essays by Armor, Craln, 
Miller, Stephan, Walberg and Wortman In order to heip readers decide what should 
be concluded from prior evaluations of how school desegregation has affected the 
academic achievement of black children. All but two of the essays contain a 
metaanalysis by the author. Craln f s paper Is one of the exceptions. Instead of 
conducting a metaanalysis, he critically discusses some of the assumptions 
behind the others 1 efforts and concludes that he will stand by the results of 
his own prior metaanalytlc work (Craln & Mahard, 1983). I shall refer to his 
prior metaanalysis based on 93 studies more than to his essay In this volume. 
Walberg Is the other exception. He devotes most of his essay to a review of 
factors other than desegregation that raise academic achievement. He does this 
to make the point that. If the purpose of desegregation Is to raise the 
achievement of black children, then more effective means exist to do this than 
desegregation. Walberg does however, reanalyze three prior metaanalyses — by 
Krol (1975), Craln & Mahard (1982) and Wortman, King and Bryant (1982)— In order 
to make the further point that. In his estimation, the average effect sizes they 
present do not reliably differ from zero. I Intend to deal with his statistical 
analysis to a small extent, but will not deal "directly with his larger point 
about rel atlve efficacy. 

The first part of the present paper deals with the metaanalytlc work of 
Armor, Miller, Stephan and Wortman, and Is largely restricted to the 19 studies 



selected by the panel. -The purpose Is to arrive at an estimate for this sample 
of how desegregation has affected the achievement of black children. I try to 
restrict my commentary to the most Important points and assumptions made by the 
authors, and make no attempt at a comprehensive analysis of any single person f s 
work In order to be comprehensive about Its strengths and weaknesses. This Is 
to keep the focus on the desegregation Issue. In the second part of the paper I 
take my own results, which are both similar to and different from those of the 
panel, and discuss several ways they can be Interpreted. In particular, I ask 
how general Izable are results from the panel's 19 studies when they are compared 
to the results from larger data bases; I probe the extent to which my findings 
speak to the Information needs of groups with different stakes In school 
desegregation; and I speculate about whose Interests the panel *s results might 
advance or prejudice. 

RESULTS 

I. Thft Studies Examined . Individual panel members considered different subsets 
G f the 19 studies that most of them deemed methodologically adequate. Armor 
dropped the study by Rentsch on grounds, first, that the desegregated group and 
the segregated controls differed by so much Initially; second, that the pretests 
and posttests Involved different measures; and third, that the desegregated 
control group contained some white children. He also dropped the study by 
Thompson & Snldchens on grounds that the segregated controls wore In classes 
made up only 42$ of minority students. However, he Included the study by 
Carrlgan, even though Its segregated control group members were In classes that 
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were hardly more "segregated"— 50$ minority. Indeed, Miller and Stephen dropped 
the Carrlgan study because of Its questionable control group. In a few other 
cases. Armor selected control groups within a study that differed from the 
choice of all other panelists. The net result of Armor f s preferences was lower 
effect sizes since (1) Rentsch obtained some of the largest effect slzos, (2) 
Carrlgan resulted In both positive and negative effect sizes, and (3) both 
Rentsch and Carrlgan Involved multiple comparisons and so their results were 
disproportionately weighted whenever comparisons were the unit of analysis 
rather than Individual studies. 

Miller dropped both Carrlgan and Thompson & Smldchens from his analyses 
because the segregated controls were not segregated. He also differed from the 
other analysts In preferring to compute an effect size per study Instead of per 
comparison. Much has been written In the metaanalysis literature on this topic, 
and our preference Is to compute or report effect sizes each way. However, If 
only one choice Is available, we favor a sample of studies because this does not 
weight the results In favor of school districts where desegregation was tested 
using several grades. 

Stephan also omitted the studies by Carrlgan and by Thompson & Smldchens. 
However, he also objected to the studies by Iwanlckl & Gable and by Slone on 
grounds that they dealt with the second year of desegregation whl I e other 
studies dealt with the first year. He further objected to Slone because the 
segregated controls were attending a school that was 40? white. This left 
Stephan with only 15 studies to analyze. Since the studies he omitted al I 
tended, with the exception of Slone, to have zero or negative effect size 
estimates. It Is clear that Stephan 1 s sampling decision disposed his analysis 
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towards a larger average.. effect size than other panelists. 

Wortman differed from the other panelists In two Important ways* First, he 
preferred his own selection of 31 "superior" studies to the panel f s 19. 
However, his analysis of the 31 showed that designs without control groups 
produced higher effects size estimates than designs with control groups. Hence, 
I treat his analyses based on studies with controls differently from the 
analyses without controls for, among other possible artifacts, maturation and. 

testing effects can Inflate estimates of the desegregation effect. Second, In 

i 

his analyses of the panel's 19 studies, Wortman was more strict than the others 
about what he would accept as valid Information about variances. Since such 
Information Is crucial for computing effect sizes he was able to produce 
estimates that also controlled for pre test differences between the desegregated 
and segregated control groups for only 1 1 of the 19 studies favored by the 
panel. One of these was the study by Carrlgan. Omitted were Clark, Evans, 
Iwan'ckl & Gable, Klein, Laird & Weeks, Slone, Syracuse, and Thompson & 
Smldchens. Since Wortman preferred somewhat different standards of 

methodological adequacy than the panel, I sometimes Include estimates computed 
from his analyses of the 11 panel studies, and^estlmates based on the larger 
subset of his preferred studies that Involved designs with control groups # a*d 
i utii These studies should overlap heavily with the panel's 
selection criteria. 

The panelists provided estimates for reading and math combined, for reading 
alone, and for math alone. It Is Interesting to note that there Is no obvious 
relationship between gains In mathematics and reading when the desegregated are 
compared to the segregated. To compute a correlation of reading and math gains 
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would not be useful because of the small number of studies and comparisons for 
which there were measures of both reading and mathematics gains. However , of 
Armor's 18 relevant comparisons, math and reading gains had the. same sign In 
seven Instances, different signs In eight, and three Instances were 
Indeterminate because of zeros. Of Miller's 13 comparisons, seven had the same 
sign and six the opposite; while of Stephan f s comparisons there were 13 with the 
same sign, 11 with the opposite, and one was Indeterminate. Math and reading 
gains were not clearly related, and little Is gained by adding them together. 
Consequently, I prefer to present results separately for each knowledge domain. 
However, for purposes of continuity with the panelists some of my reanalyses 
will Involve reading and math scores combined. When that happens, my 
analyses — like those of the panel I sts— weight reading slightly more than math 
because more reports Included reading than math measures. 

2. fanel I sts 1 Results . Using his own preferred set of studies based on a 
sample of comparisons. Armor obtained an effect size of .06 for reading and .01 
for math; Miller obtained an effect size of .16 for reading and .08 for math; 
Stephan f s values were .15 and .00; while In my analysis of Wortman f s results for 
the eleven studies with pretest adjustments, the mean effects were .26 and .08. 
(Wortman f s own results from the panePs 19 studies were .28 and .23, but this 
Includes studies where no pretest adjustments were made. His estimates from his 
total sample of 31 studies were .57 and .33, but these are based on some studies 
without control groups. Thus, I consider both of these last sets of estimates 
to be problematic). 

If we turn now to estimates of reading and math combined, Armor f s overall 



estimate was .04, Stephanas was .14 (but .67 when computed as gain per 8 month 
school year). Miller's was .12, while Wortman f s was .17 derived from the studies 
of his own choosing that had control groups. 

If one took the panel's estimates at face value they would appear to 
support the following conclusions: 

1. Desegregation did not cause a decrease In the achievement of black 
chl I dren. 

2. It probably did not cause an Increase In math skills, for the mean 
gains vary from 0 to .08 standard deviation units. 1 

3. It may have caused an Increase In reading skills, for the mean gains 
vary from .06 to .26. 

The range estimate for reading deserves comment, since the upper bound 
comes from our analysis of Wortman f s eleven studies where pretest adjustments 
could be made. This Is a considerably smaller sample than the other authors 
analyzed, and so should be treated as particularly tentative. Omitting It gives 
a revised range that permits a fourth conclusion, which 1 believe to be better 
justified than the third conclusion Immediately above: 

4. The gain In reading was somewhere between .06 and .16 standard 
deviation units. This Is between two and six weeks of gain If we follow the 
rule of thumb of Glass £± mL (1981) and associate a gain of one-tenth of a 
standard deviation with one month's gain In knowledge. 

Th9 small discrepancies between the panelists In mean estimates principally 
reflects differences In (1) the studies Included for review; (2) the way effect 
sizes were computed; and (3) a preference for some types of control groups over 
others within a few studies. I shall resist the temptation to discuss each of 
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these Issues In order to make judgments for each of them about the 
methodological option to be preferred, after which point estimates of gains 
could be computed. While such an exercise would result In easily remembered 
single number estimates of reading and math gain, the resulting precision would 
be misplaced. In metaanalysis, varying the assumptions underlying an analysts 
Is desirable because It makes heterogeneous those facets of research where no 
"right" answer Is available and fallible human Judgment Is required. To attempt 
to legislate a single "rlg^t" way either to compute effect sizes or to sample 

studies would be counterproductive so long as none of the analysts Is clearly 

i 

wrong. Indeed, the Idea of selecting a panel of methodologically sophisticated 
experts with different views on school desegregation Is predicated on the 
particular utility that would result If the panel *s estimates of desegregation's 

effects converged despite the differences In va I ues and methodo I ogf r« I 

predilections of Individual panel Istg. It Is more reasonable to expect 

"convergence" as a range than a point. To search for the elusive "true" point 
estimate of effect could Involve laborious debates about fine points of 
methodology and substance that might occur within a range of estimates that many 
would think has few practical Implications. 

Speaking personally, I am Impressed by the degree of correspondence^ between 
the panelists when only the 19 core studies are considered. None achieves 
negative estimates; all achieve larger estimates for reading than math; and the 
largest single difference—between Armor and Miller for reading gains— Is of a 
magnitude many, would consider small — viz., a difference of about one month of 
gain. 

The convergence Is all the more dramatic since, across all dependent 

• - .. - - s 
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variables, Kro' obtained an estimate of .10 from his own metaanalysis of 
"better 11 desegregation studies, while a similar estimate resulted from Craln & 
Mahard (1983) when one aggregates across all their dependent variables for the 
randomized experiments and studies with both pretest-posttest measurement and 
control groups of segregated black children. Combining math and reading and 
analyzing only the studies preferred by the present panelists, ArmoMs estimate 
was .04, Miller's was .12, and Wortman's was .17 for all the studies he found 

with pretests and black control groups, while Stephan f s estimate was .14 without 

i 

his correction for the length of time desegregation had been taking place — a 
correction that none of the other panelists made. The average of the panelists 
values Is .11, only slightly higher than the estimate obtained by Krol and Craln 
& Mahard. (However, as we later see, Craln rejects this estimate, preferring to 
base his Judgment on studies where desegregation occurs at kindergarten or first 
grade. ) 

3. Thft Distribution Problem , As a measure of central tendency the mean depends 
on a normal distribution of scores. In Figures 1 through 4 we present frequency 
distributions of reading effect sizes for Armor, Miller, Stephan, and Wortman 
based on the studies they chose to analyze. (For Wortman we add the math data 
since he presents reading effect sizes for only eleven studies where pretest 
adjustments were made, and this results In a particularly poor estimate of the 
distribution). In all cases except Miller the sample sizes are based on 
comparisons rather than studies. But Irrespective of the unit of analysis, the 
distributions are visibly skewed, with a disproportionate number of effect sizes 
falling In the upper range. 

Table 1 presents the medians and modes corresponding to the reading mean. 
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The median Is computed for a sample of both comparisons and studies and Is 

defined as the value of the (N+1 )/2th case. To compute a mode with so few 

cases, we constructed a scale composed of categories with Intervals of .10 

standard deviation units whose midpoints are presented In Figures 1-4. Each 

effect size was assigned to Its respective category, with scores of zero being 

assigned In equal proportions to the category. 0 to +.10 and 0 to -.10. For 

Miller, no value Is reported for the median of comparisons since he only 

provided data on studies. Sometimes, no mode Is presented for Wortman because 

■fi^-fct,- first's i*± tUt -Li, 
his smaller sample of studtss^felfc pretest adjustments often makes It difficult 

to determine any modal category with more than three cases falling Into It. 

Table 1 shows that, mean effect sizes for reading are larger than median 

effect sizes Irrespective of whether the latter are computed as a median of 

comparisons or of studies. It also shows that the mode Is smaller than the 

other measures of central tendency and hovers around zero. Indeed, the mean of 

the mean effect sizes across all four panelists Is .15, the mean median of 

comparisons Is .08, the mean median of studies Is .05, while the modal 

categories are of effects between +.05 and -.051 




Insert Figures 1 through 4 about here 



Table 1 was recomputed based on the 17 core studies most panelists agreed 
upon. That Is, Thompson & Smldchens was omitted since three of the four 
panelists who did metaanalyses questioned It; and Carrlgan was omitted since at 
least two of the panelists objected to the questionable nature of their 
"segregated" controls. In computing the data for Armor, the missing values for 

11 
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Rentsch were taken from Wortman. Stephan provided his own estimates for the 
studies by Iwanlckl 4 Gable and Slone that he preferred to leave out of most of 
his own analyses. As Table 2 shows, having a common set of studies reduced the 
dispersion of mean effect sl*es for reading. The range for the 
panel I sts— Wortman excepted because his analysis Is not based on the 17 studies 
and I did not want to take ftte six missing estimates from other panelists since 
that would Involve estimating about 30$ of the scores—the range shifted from 
•06 — y16 to .13 — .16. However, even with the same 17 studies per analyst the 
table still shows that medians are lower than tt» means, and that modes are 
lower than medians. 



Insert Tables 1 and 2 about here 



A corresponding table for math from the authors 1 own preferred set of 
studies Is In Table 3. Modes could not reasonably be computed due to the 
smaller number of math than reading comparisons. However, the means ©re 
consistently higher than the medians. 

Combining math and reading allows modes to be computed again and results In 
the same basic relationship between measures of centra! tendency. This Is true 
whether one uses the authors 1 own set of preferred studies (Table 4) or the 
common set of 17 (Table 5). The Individually preferred studies produce a range 
of mean estimates from .06 to .16, of median estimates from .00 to .08, and of 
mode estimates from -.15 to +.05. 



Insert Tables 3, 4, and 5 about here 
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These differences In central tendency result because the distribution of 
effect sizes Is skewed. The skewness means that. If one were willing to assume 
that the present results are applicable to the nation at I arge today—a 
dangerous assumptions- then (1) for any school district that desegregates the 
most reasonable expectation Is that there will be no effects on black 
achievement, for the mode suggests that this outcome Is obtained more often then 
any other; (2) 50% of the school districts will probably raise achievement by 
about three-one hundredths of a standard deviation (the average median of 
studies across the panelists), while 50% of them trill probably raise It by less 
than thlsj but (3) the national Impact will be to raise the achievement of black 
children In reading by between two and six weeks and to raise achievement In 
math. If at all, then by something less than three weeks—the upper range of 
mean estimates. However, (4) a minority of school districts could expect to 
make larger positive gains. Using Miller's reading estimates for the moment, 
larger gains appear to have been obtained by Anderson (.733), Beker (.400), 
Syracuse (.691), and Zdep (.6 7 1). In mathematics, the outliers were less common 
but still visible (Anderson, .,669, Klein .333, and Van Every .543). 

But Stephan's estimates make the studies with outlying results seem less 
extreme and some different out I lers emerge. He computes effect sizes! n a way 
that controls for the length of time children have been under study In a 
desegregated school. When reading effect sizes are computed per eight month 
school year , the outliers are pulled tn because they. tended to come from studies 
lasting two or three years. The new values are: Anderson (.42), Beker (.13), 
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and Zdep (•66). (Stephan leaves Syracuse out of his sample). For mathematics, 
the positive outliers now become: Anderson (.24), Klein (.33), and Van Every 
(•14). Stephan f s computation of effect sizes leads to less variable and less 
skewed estimates than the other panelists, which Is why medians and modes make 
less of a difference to his computations of central tendency than to others. 
But the choice of a measure of central tendency still makes a difference In 
Stephan f s estimates, for both reading and reading and math combined. 

However, Stephan f s work does present a puzzle. He ts the sole panel 1st to 

V 
t 

compute a median, and on page 24 of his report he mentions that the median gain 
In verbal achievement (reading) ts .13. (His corresponding means were .17 for 
the sample of comparisons and .15 for the sample of studies.) i have examined 
Stephan f s effect sizes from his Table 1 and have been unable to arrive at the 
same value. My own esttmate based on a sample of comparisons and omitting the 
studies he leaves out ts .08. Readers should scrutinize Stephan f s Table 1 and 
estimate for themselves the effect size for reading scores above which 50? of 
the effect sizes fall and below which 50? fall. 

4. The Confidence Problem . Our reanalysls of the panelists* studies using 
multiple measures of central tendency should not be tntepreted to mean that, In 
our opinion, desegregation has had no effect on most schools. There are two 
reasons for a low level of confidence In the results presented In Tables 1 
through 5. First, we do not know the underlying distribution of mean effect 
sizes (however, computed) for the population of school districts that have 
already desegregated. It ts not clear how representative the panel ■•s core set 
of studies are. Second, with so few comparisons and studies, we cannot have 
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much confidence In the sample distributions presented In Figures 1-4. A dozen 
new cases could radical ly alter each of the estimates of central tendency. With 
such a poorly estimated and unstable distribution. It Is not clear that the mean 
would remain unchanged^ even If more cases were added from the very same 
population that the present sample Is supposed to represent. 

Statistical significance tests are typically used to make Inferences about 
the level of confidence one should ascribe to findings. (Because of lay 

misunderstandings of the word fl slgnlf lcance ,f , we prefer to talk of tests of 

i 

statistical reliability rather than statistical significance.) Wal berg has 
maintained that for measures of math and reading combined, none of the estimates 
obtained by Krol, Craln & Mahard and Wortman, King & Bryant reliably differ from 
zero. In the current case, our calculations of reliability Indicate that: (1) 
For Armor, the mean estimates for math alone and for reading and math combined 
do not differ from zero, but the estimate for reading does so marginally 
(p<.10); (2) for Miller, the estimate for math does not reliably differ from 
zero, but the estimates for reading alone and for reading and math combined do 
so; (3) For Stephan, the effect for math Is not reliable, while for reading and 
for math and reading combined, conventional levels of statistical reliability 
are reached Irrespective of whether the mean Is computed with or without 
correction for the length of desegregation; (4) For Wortman, the effects for 
reading and for reading and math combined both differ from zero even when we 
consider only the small sample of studies with pretest adjustments. 

These statistical vasts are themselves partly problematic. In all cases 
except Miller, the analyses are based on a sample of comparisons* But since 
some studies produce more than one estimate of effect size, the assumption of 

" : 15 . 
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Independent errors may not be met. This particular problem does not occur In 
Miller's analysis. But there the small sample of studies Increases the 
dependence on the assumption of a normal distribution of effect sizes. But es 
the difference between the various measures of central tendency Indicates, the 
distribution of effect sizes may not be normal. Hence, all the statistical test 
results reported above (and In Walberg) should be treated with some caution. As 
they stand, they suggest that neither the mean reading effect nor the mean 
effect for reading and math combined Is due to chance. 

However, to complicate matters It Is not likely that the medians and modes 
differ from zero. The standard error of a median Is normally set at 125? of the 
value of the standard error of the means from the same distribution, reflecting 
the greater Instability of medians. By this criterion, no medians reliably 
differ from zero for reading or for reading and math combined. No estimate of 
the reliability of modes Is necessary since they hover so closely around zero. 
However, the medians and modes are based on so few cases that estimates could 
shift radically once a dozen new values are added to the distribution. 

If the population of effect sizes Is Indeed skewed. It Is not clear which 
measure of central tendency Is to be preferred. The mean represents national 
Impact at some abstract, aggregate level, and Is of use to those persons and 
groups most Interested In gaining a national perspective on education and 
society. The mode represents what should happen to the typical school, and so 
may be of most Interest to any school district or Judge considering 
desegregation, especially If the district In question differs from those where 
desegregation has produced large Impacts In the past—characteristics we shall 
explore below. For any commentator willing to assume that the distribution of 

■ 16 
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effect sizes In the -population approximates the (unclear) sampl esrdlstrlbutlons 
we have obtained. It Is Important to dec.lde._at a high level of consciousness on 
the different utl I Itles Implicit In different measures of central tendency. 

5 » Why do Some School Districts Show Larger R ains In Rparttnn ? The skewness In 
the distributions Indicates, not only that the mean may be a misleading measure 
of central tendency, but also that It might be productive to probe the reasons 
why some school districts are outliers. Discovering what they did to achieve 
larger gains could, for Instance, be used to develop specific guidelines for 
desegregation plans, which school districts could then select If they believed 
they were suitable for their schools. But since desegregation Is an amorphous 
set of activities that differs from site to site, and since we have so few 
studies, no one should expect a definitive answer to the question of what 
characterizes school districts with I arge readl ng gal ns. At most, one should 
expect grounded hypotheses to emerge. Our discussion Is In two parts: Which 
were the districts with large gains; and what differentiates them from other 
districts? 

(a) Which Were the School Dts+rtc+s with larg er Reading fa in*? Before probing 
substantive reasons for high reading gains. It Is Important to raise three 
methodological Issues that reduce confidence In Judgments about the 
Identification of valid outliers. The sample sizes In the studies under review 
vary considerably* from 12 desegregated children In Zdep to over 1 ,000 In 
Sheehan and Marcus. Several panelists analyzed the relationship between sample 
size and effect size, concluding that smaller samples tended to produce larger 
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estimates but that the relationship was not reliably different from zero. 
Considering classical sampling theory In Isolation, we would not expect samples 
size to be linearly related to effect sizes without transformation of the 
original metrics* In a normal distribution with mean equal to zero, we would 
expect smaller samples to produce larger estimates, but !n equal proportions 
each side of zero* This Is equivalent to a negatively accelerated decay 
function when plotting effect size against sample size. Irrespective of the sign 
of the effect. Figure 5 presents the mean reading effect size, free of sign, 
for studies with desegregated samples of 20 or less, between 21 and 30, between 
31 and 40, 41 and 50, between 50 and 100, and over 100. An overall relationship 
Is apparent that might well be of the expected quadratic form, though with such 
a small sample of studies It Is hard to be sure. More Important, though. Is 
that with such a small sample of studies It Is possible for more of the studies 
with smaller samples to fall on one side of the mean than the other. If we take 
the studies Identified from Miller's estimates as outliers, we note the 
following individual sample sizes In the desegregated groups for analyses of 
reading: Anderson (34), Beker (36), Syracuse (24), and Zdep (12). This Is a 
total of 106 desegregated children. Since a total of 2812 were studied for 
reading, the outliers responsible for the higher mean estimates constitute about 
4% of the total sample of desegregated children, but are about 25? of the 
studies Miller analyzed (4 of 17). If we add Rentsch to the list of outliers 
becau. 3 analysts other than Miller and Stephan place him there, then the 
outliers represent 30? of the schools studied (5 of 17) but only 1% of the 
chl I drent 
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Insert Figure 6 about here 



A second methodological reason for caution In substantively pursuing why 
some school districts have large gains Is also related to sampling Instability. 
If we were to define positive outliers In terms of their gains In both reading 
jaod math, few of the outliers would be the same as when reading was considered 
alone. Thus, the unweighted gain In Anderson, using Millers estimates, was 
.70, for Beker was .19, and was .26 for Zdep. (It was .035 for Rentsch In 
Miller's analysis). When a Joint criterion Is used to define outliers, only 
Anderson clearly emerged. Indeed, the three other studies had negative 
estimates for mathl Pursuing the Instability theme further leads us to note 
that the second largest negative outlier for reading (Van Every, -.17) Is based 
on a desegregated sample of only 20, and the math estimate Is +.541 We are not 
arguing that desegregation should have affected both reading and math. We are 
only suggesting that we would be more confident of having Identified valid 
outliers If reading and math gains were correlated among the potential outliers. 

The third methodological Issue concerns how effect sizes were computed. 
All the panelists are commendably sensitive to the need to control for 
differential growth rates between the nonequlvalent desegregated and segregated 
control groups, and all go about the task In stmtlai — but not quite 
Identical-* ways. The adequacy of statistical adjustments for 

selection-maturation depends on many factors. Including the (unknown) true 
selection difference, the reliability of measures, tho comparability of 
wlthln-group regression lines, etc. In metaanalysis, the hope Is that, across 
all the studies examined, the Inevitable Imperfections In the analysis of any 
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one study will even .out so that the average bias due to selection-maturation 
will be zero. However, there Is no presumption that the bias wll! be zero In 
any single study. Yet In analyzing outlier effect sizes one has to assume that 
the average selection and selection-maturation bias among the out! lers Is zero. 
However, one might easily have capitalized on chance and have Isolated the 
subset where adjustment has been the least adequate. Indeed, In four of the 
five outlier cases the desegregated children outperformed the segregated 
Initially, and In the other case the means were essentially Identical. 

Thus, the possibility cannot be ruled out that the outliers reflect: (1) 
sampling Instability due to small sample sizes; (2) sampling Instability that 
makes high reading gains not synonymous with general achievement gains; and (3) 
an underadjustment for Initial group differences In reading achievement. It Is 
within the limitations afforded by these three points that I now examine 
substantive characteristics of the outliers for reading. 

(b) The Characteristics of Outlier School Districts , As previously discussed, 
one characteristic of the outlier school districts on Mlller f s list Is that they 
evaluated longer periods of desegregation — up to three years In son.e cases. The 
relationship between effect sizes and length of desegregation Is not clear due 
to sampling Instability, with all the panelists who tackled the Issue concluding 
that effect sizes seem larger In the five studies with two years of 
desegregation than In the nine studies with one year of desegregation. However, 
estimates seem, to be lowest: of All In the three studies with three years of 
desegregation! Since two year studies predominate among the studies with larger 
effects In Miller's Table 2, Is suggests that effect sizes may be related to the 
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amount of desegregation that has taken p!ace. 

The predominance of two year studies among the districts with larger 
effects also leads me to prefer Stephan's estimates for defining outlier school 
districts. But to use his data I averaged his estimates across grades to give a 
single reading mean per study. The out I lers f al I Into two groups: Anderson 
(.49), Syracuse (.58) and Zdep (.66) are In the one and Klein (.23), and Rentsch 
(.22) in the other. Even listing these outliers raises once again the spectre 
of Instability, since Klein would not be an outlier for Miller, while Beker 
would be for Miller but not for StephanI 

Two substantive factors are associated with Stephen's larger effect sizes. 
One concerns when desegregation takes place. Figure 6 shows effects sizes per 
eight months of desegregation plotted against when desegregation began. The 
latter values are taken from Wortman rather than Stephen, since the Information 
about grades In Stephan's Table 1 appears to be based on the grade at which 
desegregation began In some cases and on the grade when It ended In others. 
Figure 6 shows a clear negatively accelerated decay curve, with larger effects 
the earlier the desegregation. None of the panelists obtained effects of grade 
on achievement that were as clearcut as this, probably because they computed 
linear relationships, truncated at Inappropriate grade levels, did not adjust 
effect sizes for the length of desegregation, or they assessed the grade of 
children when the study ended. Figure 6 suggests that at seconc ::r? - a gain Is 
obtained of about .30 standard deviation units per eight month year—though this 
estimate is based on only four studies!— that at the third grade the gain Is .12 
(five studies), while It Is .14 at the fourth grade (based on nine studies). 

In trying to explain why a smal I set of school district produced large 
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reading gains that skewed the distribution of effect sizes, It Is Important to 
probe whether the desegregation was voluntary or mandatory. According to 
Craln f s report In this volume, all of the school districts I have Identified as 
positive outliers had voluntary programs. This Is perhaps not surprising, since 
the programs were voluntary In 15 of our 19 studies. For reading, only three 
school districts showed overall negative effects In Stephan f s analysis — Sheehan 
& Marcus (-.07), Smith (-.01) and Van Every (-.12). The first and last of these 
were mandatory programs. Of the two other mandatory programs In the panels 
sample, the study by Carrlgsn was omitted from some analyses but, when 
aggregated across grades. It produced a small negative effect. The other 
mandatory study produced a trivial gain of .02 across grades (Evans). It Is 
clear, then, that mandatory programs were not associated with reading gains but 
that voluntary programs were. 

However, the relationship between effect size and the voluntary/mandatory 
nature of desegregation could only be considered causal for these four cases of 
mandatory desegregation If all other Interpretations of the relationship could 
be ruled out. However, two of the studies—Evans and Sheehan & Marcus~were 
done In Texas, were the only ones to use the Iowa Test of Basic Skills, and were 
two of the only three studies of desegregation activities that began In the 
1970*5. (The other study with apparent negative outcomes — Van Every—took place 
In Flint, Michigan, began In 1969, used the SRA test, and had very small 
samples.) 

Just as It would be wrong to conclude with confidence that mandatory 
programs produce^ no gains In reading, so It would be wrong to conclude from the 
panel *s core studies that desegregation beginning In the earlier grades results 
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In larger positive gains. There are signs of each relationship, but with only 
four mandatory programs and four second grade samples It Is Inevitable that we 
have not made heterogeneous all the sources of Irrelevancy that might have 
produced spurious results. The reality Is that If the sample size of studies Is 
too small to permit a meaningful analysis of central tendency across 19 studies. 
It Is even less appropriate for conducting responsible Internal analyses to try 
to explain why some school districts seem to have achieved larger effect sizes 

than others. t 

This Is true, not only of the potential explanatory factors analyzed above, 
but also of other factors about which Individual panelists have speculated. 
Stephan points out that studies conducted at an earlier date tend to show larger 
effects, while Miller suggests that school districts with larger effects may 
have Introduced enrichment programs at the time desegregation occurred and may 
have had smaller percentages of blacks In the desegregated classrooms. With the 
small samples on hand. It Is Inevitable, first, that no strong probes of the 
impact of^moderator variables Is possible; and second that many Interpretations 
remain to explain why some districts achieved particularly large positive or 
negative gains. 

The points we want to stress are that: (1) the form of the distribution of 
effect sizes Is not clear either for the copulation of school districts that 
have desegregated or even for the small sample of districts we have analyzed; 
(2) there may be districts that benefitted more from desegregation than other 
districts— but If so". If Is not clear whether they are outliers for Irrelevant 
methodological reasons (small sample sizes, unstable measures; or Initial group 
achievement differences not completely adjusted away) or for relevant 
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substantive reasons; and (3) of the relevant substantive reasons, several are 
contenders as explanatory constructs but their unique contrubutlon cannot bej 
unconfounded from the contribution of other factors* The factors at Issue 
Include: the child's grade at desegregation, the number of years of 
desegregation, whether the desegregation Is voluntary or mandatory, the 
percentage of whites In the class, the copresence of desegregation and new 
enrichment programs, and the year In which desegregation took place. 

\ 

6. Suntnary of tht> Reanalyses . A £aauai reading of the panelists 1 papers leads 
to the four conclusions mentioned earl ler that are based upon the panel *s 19 
studies and seenPqnHe consonant with the findings of prior metaanalyses by Krol 
and by Craln & Kahard that Involved larger samples. These conclusions are: (1) 
desegregation does not decrease the achievement of black children; (2) It 
probably does not Increase math achievement; (3) It probably raises reading 
scores; and (4) the Increase In reading scores Is somewhere between .06 and .16 
standard deviation units or about two and six weeks. These last estimates were 
computed from 17 studies, about half of which dealt with a single year of 
schooling, and then usually the first one after formal desegregation began. 

Our own analyses corroborate the first two of these findings. We continue 
to find no evidence that desegregation decreases achievement or that It 
Increases achievement In math. Our differences Involve the conclusions about 
reading. The present analysis suggests that whether there Is an effect or not 
depends on the measure of central tendency used, with statistically reliable 
results emerging for mean gains but not for median or modal gains. The 
Implication of the lower medians and modes Is that the mean differences are 
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found, not so much because the "average" effect of desegregation on reading Is 
positive but because — In the panel f s sample at least — some schoo! districts made 
atypical ly large reading gains that skewed the distribution of effect sizes. 

It Is therefore difficult to make an estimate of the size of the reading 
effect. There Is one range estimate for the mean (between .13 to .16 when the 
same 17 studies from the panel f s 19 are used with each analyst f s own effect size 
computations — see Table 2), another range estimate for the median (.00 to .08 
Irrespective of the samples used — see Table 1 or 2) and yet another for the 
modal effect (between -.05 and +.05— see Tables 1 and 2). Combining the reading 
and math effect sizes makes no difference to the conclusion that central 
tendency values differ. The estimated means vary between «07 and .16 for the 17 
common studies; the study medtans vary between .00 and .06; and the mode falls 
between ±.051 

Why do some schools achieve unexpectedly large reading gains? With so few 
studies this question cannot answered In any definitive way. There are at most 
Indirect suggestions that such schools may have desegregated In the 1960 f s* had 
voluntary plans. Included the earlier grades In their evaluation design, been 
studied for longer time periods, have had a higher percentage of white children 
In desegregated classrooms, and may have Introduced enrichment programs at the 
same time as desegregation. Such variables could have had Independent or Joint 
Impacts, and It Is Inevitable that other variables could be thought of that 
should be added to any list of possible explanations of why some districts 
gained so much more than others In reading* Among the possibilities Is chance, 
for It Is noteworthy that the outlier studies had smaller sample sizes and that, 
with the exception of Anderson, the districts with the largest gains In reading 
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were not the districts with the largest gains In math. While !t Is not 
necessary for desegregation to Impact on both—and Stephan gives an £& pc?st 
fflCtQ rationale for why desegregation should affect reading but not math— we 
would be more confident of having Identified valid outliers had there been more 
of a consistency In gains between reading and math* 

If the present analysts had not taken place, there would have been what I 
Interpret to be an Impressive consistency of results for reading and math 
combined. When they defined better studies their own way and combined all 
measures 1 and grades, both Krol and Craln & Mahard reached comparable mean 
estimates of .10. (For Craln & Mahard the value Is derived from the combined 
results for their randomized experiments and their two longitudinal designs with 
black segregated controls.) Using their own preferred set of studies and 
considering math and reading only, the present panelists arrived at estimates 
varying around this. Armor obtained .04, Miller .12 and Stephan .14, and 
Wortnran .17* when his two strongest designs were weighted and averaged based on 
part of his sample of 31 studies. These estimates are generally higher than the 
values of Krol and Craln & Mahard, but not by much. Indeed, I suspect that few 
commentators would find much of a difference between a gain of one month and of 
one and one-half months (.10 versus .15). 

The present analyses have muddled these waters by suggesting that the means 
above are noticeably higher than their corresponding medians or modes and by 
further suggesting that the choice of a measure of central tendency depends In 
part on knowledge of the distribution of effect sizes In the population. But 
with such a smal I sample, the true distribution cannot be confidently 
ascertained. For those who accept my analyses, I have substituted a low degree 
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of certainty about the effects of desegregaton for the higher degree that used 
to pertain but that depended on distributional assumptions that may be wrong. 
Social science analyses often Increase uncertainty, and this Is to be preferred 
to a premature certainty about something wrong or misleading. However, It Is 
even more preferrable to reduce quickly new sources of Identified uncertainty. 
In the present case, this means examining the distributions obtained by Grain & 
Mahard (1983) for their better studies to see If they are skewed. 

i 

7. A Comparison of the Present Results with Cratn & Mahard . Craln & Mahard 
(1983) Insist that the effects of desegregation are best assessed from 
randomized experiments and from studies where desegregated schooling begins at 
kindergarten or grade one so that the child has never known segregated 
schooling. When the randomized- experiments and the studies with kindergarten 
and first grade samples were studied separately, Craln & Mahard obtained 
estimates of .30 In each case. They therefore Interpreted this as the best 
estimate of the effects of desegregation on the achievement of black children. 
Such an effect Is moderately large by many of the (arbitrary) standards used for 
assessing the effects of educational Interventions, as Wa I berg's essay In this 
volume attests. It Is certainly a more optimistic value than obtained In the 
metaanalyses reviewed here. Hence, we will consider the estimates of Cratn & 
Mahard tn some detail. 

It Is clear that their estimates decrease to some extent when we consider 
medians and modes rather than means. Cratn kindly supplied me with the 
distribution of effect sizes for the seven comparisons Involving randomized 
experiments, with Zdep omitted. The mean was .27, tht v ^' >n .24, and the mode 
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could not be computed. For the kindergarten and first grade samples evaluated 
using before-after designs and black segregated control groups, the mean based 
on 17 comparisons was .31/ and the median and mode were each .26. I do not know 
what the mean, medlar, and mode were for all the studies and fll 1 .the grades with 
before-after measures and black controls. Nonetheless, the data above suggest 
that the medians and modes do not reduce to zero In the studies that Craln and 
Mahard prefer for estimating the effects of desegregation. 

Unfortunately, the results of Craln & Mahard are not easy to Interpret as 
estimates of generalized causal Impact. First, nearly all the randomized 
experiments were part of Project Concern and so offer little comfort as to the 
general IzabI I Ity of effects. Also, with so few degrees of freedom In the 
analysis of randomized experiments, It Is not I Ike I y that the mean effect 
reliably differs from zero. Second, only one of the kindergarten and first 
grade samples of Craln & Mahard was Included In the present panel »s 
samp I e—Carrlgan— despite the specification of both Craln & Mahard and the 
present panel that before-after designs and black controls characterized better 
studies. This discrepancy In the number of comparisons presumably occurs 
because of differences In strategies used to estimate standard deviations 
and— principal I y— because Craln & Mahard were willing to accept pretest measures 
that the present panel would not accept because It required that pretest and 
posttest measures tap Into the same conceptual domain. For understandable 
' reasons the pretest measures of very young children tend to reflect "academic 
readiness" rather than the academic achievement that Is assessed at the 
posttest. If the usual selection bias operated and the chl I dren attending 
desegregated schools were more able or more motivated than their segregated 
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counterparts, then ~ the reduced pretest-posttest correlation caused by. 
dtfferences between the readiness and achievement measures would probably result 
tn overestimating the effects of desegregation In each study (Campbell & Boruch, 
1975). Consequently, *tt Is unlikely that valid estimates of the effects of 
desegregation were obtained with the kindergarten and first grade samples of 
Cratn & Mahard, though the authors have Indeed Identified a significant Issue. 
After the first generation of desegregation In a district, no students enter 
desegregated schools from segregated ones—nearly all begin and end their 
schooling tn desegregated classes. Consequently, It Is of special Importance to 
learn how desegregation Is related to the achievement of very young children. 

The estimate of Craln & Mahard that most closely approximates the work of 
the present panel Is based on all grade levels, all outcome measures, 
before-after designs, and black control groups. As mentioned earl ter, the 
estimate they obtained was .10, and this Is much closer to the panel f s estimate 
than the probably Inflated value of .30 provided by studies of kindergarten and 
first grade children where Initial differences were not well controlled for. 
However, nothing In the present panel's work specifically refutes an implicit 
claim — tn Cratn & Mahard— that desegregation may have larger Impacts at younger 
grades. To say that .30 may be Inflated Is not to say the true value for the 
youngest children ts .101 The Issue of grade differences In effect sizes has 
not been solved by either the present panel or Cratn & Mahard, and must remain 
on Issue for further research. 
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I want now to Interpret the meaning of both the absence of gains In 
mathematics and the presence of reading gains of. between two and six weeks. To 
do this, I broach two Issues. First, I ask what Implications the findings have 
for various stakeholder groups, and In so doing I also explore how general Izable 
the findings are beyond the 19 studies examined. Second, I ask what 
Implications this metaanalysis project has for theories of research synthesis. 

1. Stakeholder Analysis 

(a) ProtaflPn I sts fif School Desegregation. The analyses I have presented 

might give some comfort to protagonists of school desegregation, particularly 
those who support It for reasons of equal access, the Improvement of race 
relations, or the enhancement of self-esteem rather than for reasons of academic 
achievement. For such protagonists the crucial finding from all the analyses of 
all the scholars Is that school desegregation does not decrease the achievement 
of black children. If tt did, this would represent an undesirable side effect 
of desegregation with which protagonists would probably have to deal ethical ly. 
Ideologically, and politically. My guess Is that It Is more difficult to argue 
that a decrease In achievement Is of no consequence than It ts to argue that the 
absence of an ■ Increase Is of no consequence. Unintentional ly decreasing 
ach I evement wou I d be a worr I some s I de ef f ect of desegregat I on that no 
protagonist could Ignore. 
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Protagonists of school desegregation can also take some succor from an as 
yet Imperfectly corroborated trend In the data. This Is that achievement gains 
may be larger In younger children who have not had to go through as long a prior 
experience In segregated classes. Indeed, one of the major points In Craln & 
Mahard— that we could not Independently te N §t— Is that achievement gains are 
greatest of all If black children have never been desegregated. This Is a very 
Important point, for many of the advocates of desegregation view It as a means 
of providing desegregated— or preferably, fully Integrated— education to at I 
children for alt of their school career. From this perspective, the group of 
children who start out In segregated schools are not the group of greatest 
Interest. Of more concern are those who have nevar been segregated and will 
never experience the historically circumscribed difficulties associated with 
being among the very first children to transfer Into a desegregated school 
district. Such pioneers move Into envJronments that are novel, not only for 
them, but also for teachers, administrators, parents and local leaders. Because 
of the novelty, more mistakes are likely to occur than Is the case at a later 
date when new cohorts of children come through the system, and teachers, 
administrators and parents should have benefitted from ikHtr mistakes. Later 
cohorts might be expected to benefit more from desegregation, both because they 
have never known segregated schooling and because the school personnel are more 
experienced^ eXut/tc^ WvW*gA rttclp^ £ait<Ay- 

Protagonists of desegregation might also note that over half of the studies 
examined by the present panel Involved only one year of desegregation. 
Moreover, the typical fall-spring tasting sessions Involve less than a complete 
school year. Thus, most of the studies Involved only a small fraction of the 
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total time that children experience desegregation, especially If they enter 
desegregated schools In the early grades. Protagonists of school desegregation 
might wonder If Its full Impact has yet been evaluated and they may point to the 
larger effects In two year studies to suggest that the cumulative Impact of 
desegregation may be much larger than Its first year effect. The major problem 
with this argument Is that the studies testing three years of desegregation 
produced no effects. Consequently, protagonists of desegregation would have to 
discredit the three-year studies In order to make the case that desegregation 
has not yet been tested at Its presumptively most efficacious. However, It Is 
not difficult to discredit these studies since they are only three In number and 
they undoubtedly differ from the majority of studies In many ways that are 
correlated with lower achievement gains. 

2. The Perspectives of Antagonists o f School Desegregation. The present 
analyses should bring most succor to antagonists of school desegregation. Where 
before they would have had to acknowledge the gains In reading caused by 
desegregation and would have had to argue that their practical Implications are 
trivial—as Armor has done In his present essay-- antagonists can now point to 
analyses which suggest that there have been no real gains In reading because of 
desegregation In most school districts. This Involves a shift In the 
argument-- from how meaningful the obtained reading gains are considered to be, 
to whetht: there are any gains at all whose value Is worth debating. But 
although the medians and modes In Tables 1 through 5 could be used by 
antagonists of school desegregation, I have tried to stress how unstable these 
estimates are and how much they might be changed by adding Just a dozen more 
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cases to the distribution of effect sizes. 

Antagonists of school desegregation can also point to the opaque trend In 
the data for mandatory programs to result In zero effect sizes and for larger 
effects to be found with voluntary programs. Few antagonists of desegregation 
oppose plans In which local authorities agree to desegregate and receiving 
schools voluntarily accept pupils who volunteer to go to the receiving schools 
(or whose parents "volunteer" for them). The objection Is to mandatory 
desegregation which. In both my analysis and Stephan f s, produced no reading or 
math gains. (This comparability was achieved despite the fact that Stephan 
classified only two of the panel f s studies as mandatory, whereas using the 
essays In this volume by Craln and Armor, I classified four as mandatory, 
although one was by Carrlgan.) However, little confidence can be placed In the 
Idea that mandatory desegregation plans £auS£ no reading gains. Given the small 
number of studies overall, and of mandatory studies In particular, the 
mandatory/voluntary distinction was correlated with the year desegregation took 
place, the test used to measure achievement, the region of the country (two 
studies were In the Dallas/Ft. Worth area), and was probably also correlated 
with many other factors that would emerge as soon as one examined In detail the 
specifics of the mandatory desegregation studies by Sheehan & Marcus, Evans, and 
Van Every. 

Antagonists of school desegregation can also point to the paucity of 
clearcut evidence about desegregation plans that will raise school achievement. 
Protagonists of school desegregation, and persons whose Job it Is to plan the 
desegregation effort In a particular community, want to know what types of 
desegregation will be effective. They prefer this specific question to the more 
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global: "How effective !s desegregation In general In raising achievement?" 
All the parties concerned with desegregation research realize that there Is no 
standard desegregation treatment, but many of the protagonists of desegregation 
hope to discover a set of activities that, when Implemented In newly 
desegregated schools, will raise achievement, among other things. The present 
analysis has pointed with little confidence to some possible elements of 
effective desegregation plans. But nothing In the list of elements Is new, and 
after the panel f s reviews nothing Is better "proven" as a causally efficacious 
element of desegregation plans than was the case before. Antagonists can point, 
therefore, to the sal lency the present review gives the continuing 
uncertainty about the elements of desegregation that enhance achievement. This 
Is not to say that the present metaanalysis probed al I— or even most — of the 
prospective causal elements, or even that It probed the better corroborated 
among them,, All we maintain Is that It probed some of them, but failed to make 
us any more confident that we know how to put together desegregation plans that 
will raise achievement In reading and math* 

(c) Persons Planning Desegregation Activities. Irrespective of their 
personal beliefs about the desirability of desegregation, mandated or otherwise, 
there are some groups of persons who have to plan desegregation activities. One 
such group consists of Judges, civil servants, consultants, and school district 
officials who develop desegregation plans for school districts or metropolitan 
areas. Such persons want to know about the types of desegregation plan, or the 
major elements within an overall plan, that will produce the kinds of outcomes 
they" most value from desegregation. The present paneMs work provides nothing 
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of substance to help such planners* It might, however, make a minor 
contribution to undermining their morale, for the difference In outcomes between 
the means, medians and modes suggest that the effects of their labors on 
achievement are likely to be minimal, at least In the short term and to the 
extent the backward- looking analyses on which this review Is based are pertinent 
to the Immediate future. 

This last point Is crucial. For many theorists of evaluation Its function 
Is less to summarize what has happened In the past and more to discover what 
might be effective In the future. In this context. It Is worth noting that the 
major difficulties with metaanalysis concern the possibility that the bias In 
one direction may be greater than In the other across all the studies under 
review. The panelists dealt exhaustively with biases that might lead to false 
conclusions about whether the relationship between desegregation and learning 
gains Is causal, but few of them considered biases that limit the 
general Izabl I Ity of findings and hence their presumed utility for planners. In 
fact, 16 of the 19 studies were begun In the 1960 f s, and only one Is later than 
1975. The dearth of later studies ts striking, and Armor f s essay contains an 
Important paragraph expressing Indignation that so f<2* evaluations of school 
desegregation were undertaken In the 1970*5,, a decade characterized by so many 
large-scale evaluations In other areas within education. Most of the 19 studies 
under examination were dissertations or local efforts by the staff of a school 
district. This may expla!a why the sample sizes are so small, the documentation 
of desegregation activities so meagre, and the measurement plan so sparse. 

Another constant bias Is obvious. The panel was constrained to examine how 
desegregation Impacted on the achievement of black children. Yet for most 
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planners achievement does not exist In a vacuum. The utility of the achievement 
gains caused by desegregation can vary In meaning depending on. whether the 
desegregation activities In question also reduce or widen achievement gaps 
between blacks and whites, are or are not accompanied by an Increase or 
reduction In Interracial prejudice, are or are not accompanied by white flight, 
are or are not associated with self-esteem gains, are or are not associated with 
community support, are or are not related to changes In real estate values, are 
or are not associated with the founding of magnet or lab schools, t etc. By 
examining Just school desegregation and black achievement much of the 
Interpretative context vital to planners Is lost. 

A second group of planners Is composed of teachers, both those 
contemplating desegregation and those already teaching In desegregated 
classrooms. In theory, research could be of help to them In Identifying 
practices they can Implement that will Improve the functioning and results In 
classrooms. However, the present metaanalytlc efforts do not speak to such 
learning needs. The teacher's needs are more micro than macro, more concerned 
with process than outcome, and with explanation than descriptive causation. The 
question on which the panel worked Is 3 question that meets the Interests of 
central government officials with responsibility for oversight more than It 
meets the Interests of those who must plan for desegregation In specific school 
contexts. 

(d) persons Honestly See i ng to Learn what Depreciation Has Accomplished . 
The panel's papers help those who would honestly understand what desegregation 
has accomplished by questioning the utility of so global a label as 
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"desegregation". MM I er , s analysis shows that, after the mean effect size Is 
accounted for, more variance remains than Is duo to chance. This suggests that 
systematic forces have to be taken Into account over and above whether 
desegregation took place If there Is to be any reasonable prediction of effect 
sizes. Elementary consideration of the decentralized structure of educational 
decision-making suggests that desegregation plans will differ from location to 
location and that, even where they appear similar on paper, there will be local 
adaptations to suit local conditions. From the perspective of someone seeking 
to learn what desegregation has achieved 1 , elementary questions need to be asked: 
"What does desegregation mean?"; "What are the criteria that should be used to 
create clusters of desegregation activities?"; "What types of desegregation 
result from "such clustering procedures?"; and "How well do the different 
clusters or types of desegregation predict differences In achievement outcomes 
across districts?". At present, persons Interested In learning about school 
desegregation are more likely to have learned to Identify the more pertinent 
questions than they are to have learned answers to these questions. 

But there are some persons Interested In the effects of desegregation, very 
globally conceived, most of whom are government officials with oversight 
responsibility. Journalists, or scholars. The present essay may help sensitize 
them to the possibility of considerable differences In effects from district to 
district and to the possibility that, across all districts, effects may be 
highly variable and even skewed. The possibility of skewness might present them 
with a problem. Although the mean represents the global Impact of desegregation 
painted on a broad national canvas. It Is of no comfort to Judges and school 
districts contemplating desegregation or to teachers worrying about how to 
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handle a racially mixed class* For some of these people, the mode Is more 
Immediately meaningful than the mean. It may be less meaningful In the future, 
of course. If (D there really are outliers, (2) the causes of large gains can 
be explained, and (3) school districts can adopt the causal elements present In 
the schools with large effects. But we do not yet know what these elements are. 
In the absence of such knowledge, the differences between the means, medians, 
and modes highlight anew the conflicting Information noeds of the many groups In 
the nattonal educational system who , have a stake In desegregation. The 
differences are most apparent (1) with respect to what should be 
evaluated — desegregation In general, a specific type of desegregation plan, the 
particular plan In a particular district, or elements within plans?; and (2) 
with respect to what should be assessed — achievement, school discipline, race 
relations, self-esteem, enrollment figures, local tax support for education, 
local political support for desegregation, home values, etc? But the 
differences In Information needs are also apparent with respect to (3) which 
measure of central tendency Is most appropriate. Different measures speak more 
to the Interests of some stakeholders than others. 

2. Theories of Research Sythesls. The present panel represents a unique attempt 
to probe to what extent experts with three different presumed commitments would 
converge on a common answer about how desegregation has affected the achievement 
of black children. Cratn and Wortman had already concluded In review articles 
or papers that desegregation Increased achievement} the opposite conclusion has 
been drawn by Armor and Miller; while Stephen and Walberg had published on the 
Issue but had taken more neutral stances, although Walberg has given court 
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testimony largely opposed to desegregation. The hope was to achieve a common 
estimate of effect size despite the different commitments, based on a theory 
that the results would be more credible, and perhaps even more valid. If they 
could be replicated across the heterogeneity associated with the analysts 1 prior 
professional commitments. 

In general, the effect sizes for math and reading combined did reflect the 
prior commitment. Highest were those of Wortman (.17) and Craln, who stressed 
the results from his kindergarten and first grade samples and from the 

V 
t 

randomized experiments he studied (.30 for al I outcome measures combined). The 
next highest estimate was from Stephan (.14 without corrections for length of 
desegregation), and lowest of all was Armor (.04). The person least fitting 
expectations was Miller, whose .12 value was Intermediate. 

Actually, the theoretical rationale for pluralism of analysts was only 
partially realized, given the decision mad9 before the panel met to restrict the 
metaanalyses to "good 11 studies and to use Wortman f s prior work to generate that 
list. One of the major points In metaanalysis where Ideology and other 
commitments enter In Is when relevant studies are selected for analysis. Panel 
members were free to suggest studies for the core list, and Armor succeeded in 
having two studies added that had negative effect sizes (Sheehan & Marcus, and 
Walberg). He also made a strong and persistent case for excluding Rentsch and 
Including Carrtgan. But few considered calls were heard to add other studies, 
even though Craln had a list of 93 that he and Mahard considered relevant, more 
than half of which may have been randomized experiments or longitudinal designs 
with segregated black control groups. In retrospect, the decision to restrict 
the selection criteria to a common set rather than let the panelists select 
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their own, and the failure to assess each of Craln f s 93 studies according to the 
panel f s criteria of adequate methodology, may have unnecessarily restricted both 
the sample of studies and the heterogeneity In assumptions on which the theory 
behind the use of multiple panelists depends. 

It Is not difficult to see why the decision was made to restrict the 
metaanalyses to "better" studies. After all, Krol has found smaller estimates 
with his "better" studies, as also had Wortman, King and Bryant. But Craln 
obtal ned I arger estimates with his "better" studlesl Obviously, chance 
differences In the studies available, or differences of opinion about what makes 
better studies, may have contributed to the apparent puzzle about whether 
superior methods were associated with larger or smaller effect sizes. Another 
point Is also worth keeping In mind. Although one of the rationales for 
pluralistic panel members was the credibility and validity afforded by 
convergence, a second rationale Is that dtvergence In thetr results might serve 
to force out the differences In assumptions between advocates and opponents of 
desegregation, thereby sharpening the focus for future research. Yet the 
likelihood of such differences being forced out Is presumably greater the nore 
freedom panelists have to select studies for review. 

AnPthSP d§§I§ton that WQS made before the panel convened was to use 
metaanalysis. This techntque depends most heavily on the assumption that the 
average bias Is zero with respect to threats to Internal, external * construct, 
statistical conclusion, or any other type of va| Idlty (Cook & Levlton, 1980). 
This assumption Is usually dealt with In either or both of two ways. First, a 
subsample of studies Is Isolated for which the assumption Is made that the bias 
Is zero, and the estimate from this sample Is then compared to the estimate for 
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the remaining subsamjxle where bias might be a problem. If there are no 
differences In the estimates, the conclusion Is drawn that the biasing force In 
question has not operated. The second strategy Is to assume the source of bias 
away by postulating that the total sample studied Is heterogeneous with respect 
to the threat In question. This last assumption Is more credible the more the 
sample differs on Irrelevancl.es correlated with the major outcomes. 

Desegregation research Is problematic for the metaanalyst since Wortman has 
shown that studies without control groups might be biased and few analysts are 
willing to use norms or white children as "control groups". The need for 
control groups entails that few studies will meet minimal methodological 
characteristics. The sample of studies will also tend to be highly variable, 
given the wide range of desegregation' activities In the decentralized education 
sector and the wide range of children, grades and times studied. Consequently, 
small samples of possibly abnormally variable estimates will be metaanalyzed. 
It Is difficult to Imagine arriving at confident estimates of distribution and 
central tendencies In this situation; and It Is also fool hardy toexpect to 
break the data down In multiple ways so as to examine the correspondence tn 
estimates across different types of desegregation activities, different years 
when desegregation began, different regions of the country, etc. Consequently, 
to rule out threat^ one has to rely on there being "enough" variability In 
region, year of study, type of activities Implemented, etc. But given the small 
samples, It Is not easy to be confident of "enough" heterogeneity In conceptual 
I rra lev ancles. • Hence, the low lever of confidence I have placed tn most of my 
own conclusions and those of the panelists. 

These metaanalytlc endeavors point to another problem with the method that 
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overlaps with the problems In using small samples to estimate populations that 
may be complex and highly variable. Once one has postulated that a skewed 
distribution may be present, the guiding question becomes the explanatory one: 
"Why are there outliers?" Explanation Is not a strong point of metaanalysis. 
To explain presumes that we have measures of the potential explanatory 
constructs for a large sample of studies. Rarely Is this the case with 
metaanalysis, for their availability depends (1) on the extensive measurement of 

what Is Implemented as part of a treatment— In the desegregation studies 

\ 

examined, little was available from reports to help with this; and (2) on the 
extensive measurement of causal micro-mediating processes. For desegregation 
and reading, such measurement might Include, but not be limited to, the 
assessment of dominant language patterns Inside and outside of classrooms. But 
the sample size of studies with such measures might be expected to be low since 
the relevant hypothesis about language patterns had not been developed when the 
earlier evaluators did their work. Indeed, the theory developed because of 
their work and the anomol tes In the data which the work revealed. Since the 
number of studies with adequate measures of potential explanatory variables will 
often be low In metaanalysis for reasons of cost and because of the dynamic, 
evolving nature of theoretical explanatory constructs, metaanalysis will rarely 
result In confident explanation. This was certainly the case In trying to 
explain the outliers In Figures 1 through 4. Many potential explanatory forces 
were Isolated, but none of than could be unconfounded from each other tilth the 
sample sizes and measures on hand. 
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CONCLUSIONS 



My own reading of the panelists 1 papers and my own analyses lead me to the 
following conclusions about how school desegregation has Influenced the academic 
achievement of black students. The conclusions are based on only about 17 
studies, and thetr general tzabt I tty ts unknown. 



2L. On the average, desegregation did not cause an Increase tn achievement In 
mathematics. 

|. Desegregation Increased mean reading levels. The gatn reltably differed 
from zero and was estimated to be between two and stx weeks across the studies 
examined. Only one panelist (Stephan) computed the reading effect per 8 month 

school year. His estimate Is between five and stx weeks, of gal n^per year. But 

since none of the studies tnvolved more than three years of post-desegregation 
research, tt ts not "possible to compute the mean gain over a chtld f s total 
school career tn desegregated classrooms. 

$p The median gains were almost always greater than zero but were lower than 
the means and dtd not reltably differ from zero. The nipdal gains were even less 
than the median gatns and varied around zero. 

£1 The differences between the means, medtans and modes result because the 
dtstrlbutton of reading effects appears to be skewed, with a disproportionate 





ERIC 



43 



42 

number of school districts seeming to obtain atypical iy high gains, 

£• Studies with the largest reading gains can be tentatively characterized 
along a number of methodological and substantive dimensions. Including: small 
sample sizes, the study of two or more years of desegregation, desegregated 
children who outperformed their segregated counterparts even before 
desegregation began, and desegregation that occurred earlier In time. Involved 
younger students, was voluntary, had larger percentages of whites per school, 
and was associated with enrichment programs, \. 

^. None of the above factors can be tsolated > slngly or In combination, as 
causes of any of the atypical ly large achievement gains In reading that were 
obtained In some school districts. 

The panel examined itfttb only 19 studies of desegregation, with most 
panelists rejecting at least two of them on methodological grounds. When the 
results for each study (or each comparison) are plotted for reading or 
mathematics, the distributions are based on so few observations that I could not 
accept the assumptton that the obtained distributions closely approximate what 
the underlytng population distributions are. Because of the small samples and 
apparently non-normal distributions, little confidence should be placed In any 
of the mean results presented earlier. I have little confidence that we know 
much about hpw desegregation affects reading n on the average" and, across the 
few studies examined, I ftnd the variability tn effect sizes more striking and 
less well understood than any measure of central tendency. 
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Table I 

Central Tendencies for Reading - Author's own Preferred Studies 



Median of 
Mean Comparisons 



Armor 
Miller 
Stephan 
Wortman 3 



.06 
.16 
.14 
.26 



.00 

.08 
.15 



Median of Midpoint of Modal 

Studies Category of Comparisons 



,00 
,06 
.08 
,04 



-.05 6. +.05 
-.05 & +.05 
+.05 



a In Wortman's case "preferred" studies refers to those of his selection from the 
panel's core 19 for which pretest adjustments could be made. It does not refer 
to his analysis of 31 studies. 



46 



45 



Table 2 

Central Tendencies for Reading - 1 7 Common Core Studies 



Armor a 



Miller 1 



Stephan** 



Wortman c 



Mean 



.13 
.16 
.13 
.26 



Median of 
Comparison 



• 03 

.07 
.15 



Median of Midpoint of Modal 

Studies e Category of Comparisons 



0 

• 06 
.08 

• 04 



-.05 & +.05 
-.05 & +.05 
+ .05 



a Based on N of comparisons; Carrigan and Thompson & Smidchens omitted; 
Rentsch added and given Wortman values. 

b Based on N of studies; Carrigan and Thompson & Smidchens omitted. 

c Based on N of comparisons; Carrigan and Thompson & Smidchens omitted. 
Thus, Iwanicki & Gable and Slone added. 

^ Based on N of comparisons. The sample si2e is considerably smaller than 

with other analysts, since Wortman omitted all instances where the control group 
standard deviation was not specifically given. This resulted in the omission 
of Clark, Evans, Iwanicki & Gable, Klein, Lard & Weeks, Slone, Syracuse, and 
Walberg, as well as Carrigan and Thompson & Smidchens. No mode was ascer- 
tainable* 

e The medians are from Miller's Table 2 for each author based on N of studies rather 
than comparisons* 
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Table 3 

Central Tendencies for ES Values In Math - Author's own Preferred Studies 



Armor 
Miller 
Stephan 
Wortman 



Mean 



.01 
.08 
.04 
.08 



Median of 
Comparison 



-.05 

.02 
-.02 



Median of Midpoint of Modal 

Studies Category of Comparisons 



-.06 
.07 
.02 

-.05 



In Wortman' s case "preferred" studies refers to those of his selection from the 
panel's core 19 for which pretest adjustments could be made. It does not refer 
to his analysis of 31 studies. 
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Table 4 



Central Tendencies for Reading and Math Combined - Authors' own Preferred Studies 



Mean 



Median of 
Comparisons 



Median of Midpoint of Modal 

Studies Category of Comparisons 



Armor 
Miller 

t 

Stephan 
Wortman 1 



.06 
.12 
.07 
.16 



.00 

.05 
.08 



.00 
.06 
.05 
.01 



-.05 

-.15 & +.05 

-.05 

-.05 



a In Wortman's case "preferred" studies refers to those of his selection from the 
panel's core 19 for which pretest adjustments could be made. It does not refer 
to his analysis of 31 studies. 

" These are estimates per school year. 
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Table 5 

Central Tendencies for Reading and Math - 17 Common Core Studies 



Median of Median of Midpoint of Modal 

Mean Comparisons Studies e Category- of Comparisons 



Armor .08 0 0 -.05 

Miller b .12 . 06 -.15 & +.05 

Stephan c .07 .03 .06 +.05 

Wortman d .16 .08 .01 -.05 



a 



Based on N of comparisons; Carrigan and Thompson & Smidchens omitted; 
Rentsch added and given Wortman values. 



b Based on N of studies; Carrigan and Thompson & Smidchens omitted. 



c 



d 



Based on N of comparisons; Carrigan and Thompson & Smidchens omitted. 
Thus, Iwanicki & Gable and Slone added. Estimates of effect per school year. 

Based on N of comparisons. The sample si2e is considerably smaller than with 
other analysts, since Wortman omitted all instances whare the control group 
standard deviation was not specifically given. This resulted in the omission 
Clark, Evans, Iwanicki & Gable, Klein, Laird & Weeks, Slone, Syracuse, and 
Walberg, as well as Carrigan and Thompson & Smidchens. 

e The medians are from Millers Table 2 for each author based on N of studies 
rather than comparisons. 
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Figure 1: Distribution of Reading Effect Sizes in Armor 
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Figure 2i Distribution of Reading Effect Sizes in Miller 
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MIDPOINT OF ES CLASS 
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figure 3; Distribution of Reading Effect Sizes in Stephan 
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Figure 4: Distribution of Reading and Math Effect Sizes &> . id 
for the Pretest-Adjusted Studies of Wortman 
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• Figure 5: Relationship between Sample Size and Magnitude of Effect Size 

Irrespective ro, their Sign 

50.. 




Figure 6: Relationship between Grade Level at Desegregation and Mean 

Effect Size per Eight Months of Desegregation 




